Large Scale Cross-Correlations in Internet Traffic 
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The Internet is a complex network of interconnected routers and the existence of collective be- 
havior such as congestion suggests that the correlations between different connections play a crucial 
role. It is thus critical to measure and quantify these correlations. We use methods of random 
matrix theory (RMT) to analyze the cross-correlation matrix C of information flow changes of 650 
connections between 26 routers of the French scientific network 'Renater'. We find that C has the 
universal properties of the Gaussian orthogonal ensemble of random matrices: The distribution of 
eigenvalues — up to a rescaling which exhibits a typical correlation time of the order 10 minutes — and 
the spacing distribution follows the predictions of RMT. There are some deviations for large eigen- 
values which contain network-specific information and which identify genuine correlations between 
connections. The study of the most correlated connections reveal the existence of 'active centers' 
which are exchanging information with a large number of routers thereby inducing correlations be- 
tween the corresponding connections. These strong correlations could be a reason for the observed 
self-similarity in the WWW traffic. 

PACS numbers: 02.50 -r, 05.45. Tp, 84.40. Ua, 87.23.Ge 



I. INTRODUCTION 



Internet connects different routers and servers using 
different operating systems and transport protocols. This 
intrinsic heterogeneity of the network added to the unpre- 
dictability of human practices [Q make the Internet in- 
herently unreliable and its traffic complex ]^|||],||,|J . Re- 
cently, there has been major advances in our understand- 
ing of the generic aspects of the Internet [@,|],|]|l(]] and 
web 1 11 lj|i^,|lj,|l5|,[l6| structure and development, re- 
vealing that these networks can exhibit emergent collec- 
tive behavior characterized by scaling. Concerning data 
transport, most of the studies focus on properties at short 
time scales (usually < 1 min) or at the level of individual 
connections p Hl7| , p^ |. In particular, it has been shown 
that for wide- and local-area networks the self-similarity 
(for time correlations) applies. Possible reasons for this 
behavior were shown to be |l7j the underlying distribu- 
tion of WWW documents, the effects of user 'think time', 
and the addition of many such transfers. 

Studies on statistical flow properties at a large scale 
H[l|](]|n| concentrate essentially on the phase transition 
from a 'fluid' regime to a 'congested' one for which the av- 
erage packet travel time is very large [^(| . The existence 
of such a collective behavior indicates the importance of 
spatial correlations between connections at a large scale 
in the system. In order to be able to understand and 
to model the traffic in the network, it is thus important 
to measure and to quantify the correlations between the 



flows in different connections. 

In this paper, we analyze the correlations between dif- 
ferent connections of a wide area network which is the 
French scientific network 'Renater'. We use random ma- 
trix theory (RMT) to study the corresponding empirical 
correlation matrix. RMT has been developed in the fifties 
for studying complex energy levels of heavy nuclei pl| ] 
and more recently it has also been used in the study of 
correlations of stocks ]22|,^3| or statistics of atmospheric 
correlations [ p5| . 

We first demonstrate the validity of the universal pre- 
dictions of RMT for the eigenvalue statistics of the cross- 
correlation matrix. However, we observe some deviations 
compared to the minimal hypothesis of random indepen- 
dent time-series. These deviations from the universal 
predictions of RMT identify system-specific, non-random 
properties of the network providing clues about the na- 
ture of the underlying interactions. This result allows one 
to distinguish genuine correlations in the network which 
are not just due to noise. 



II. EMPIRICAL RESULTS 

A. Data studied 

We use data from the French network 'Renater' p6| ] 
which has about 2 million users and which consists of 
about 30 interconnected routers (Fig. [I]) . Most Research 
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institutes, technological, or educational institutions are 
connected to Renater. 

The data consist of the real exchange flow (sum of Ftp, 
Telnet, Mail, Web browsing, etc.) between all routers 
even if there is not a direct (physical) link between all 
of them. For a connection between routers i and j 
(i ^ j), Fij(t) (in bytes per 5 minutes) is the effective 
information flow at time t going out from i to j (the flow 
going from i to k via j is excluded from ) . For techni- 
cal reasons, data for a few routers were not reliable and 
we analyzed data for 26 routers which amounts in 26 x 26 
matrices Fij (t) given for every sampling time scale r = 5 
minutes during a two weeks period. We also exclude from 
the present study the internal flow Fa , and the nights for 
which the flow is essentially due to machine activity. We 
thus studied data for days (8am-6pm), which amounts 
to a total of N = 26 x 25 = 650 different connections 
given for L — 12 x 10 x 14days = 1680 time counts. We 
choose as a measure of the magnitude of the time-series 
fluctuations the growth rate defined as the logarithm of 
the ratio of successive counts 



9ij (*) = log 



Fij(t + T) 



(1) 



for t = 0, ■ • ■ , (L — 1)t. This measure has several nice 
properties. First, any multiplicative, time-independent 
sample bias cancels in the ratio. Second, this measure has 
a natural interpretation in terms of relative growth since 
for a small increase gijit) — it + At) — F^ (t)] / F^ (t) is 
simply the relative increment. A large value of this quan- 
tity reflects a large activity (i.e. a large flow variation), 
while a small value corresponds to an almost constant 
flow. This measure is thus independent from the volume 
of information exchanged and thus does not eliminate 
the 'small' routers. The study of volume flow exchange 
will be published elsewhere |32| and in the present pa- 
per the quantity g allows us to study more subtle effects 
such as the activity of a regional router, independently 
of its 'size' measured in terms of exchanged information 
volume. 



B. Correlation matrix 

The simplest measure of correlations between differ- 
ent connections and (k,l) is the equal-time cross- 
correlation matrix C which has elements 



C, 



(ij)(ki) 



(9ij9ki) ~ (9ij)(9ki) 

&ij&kl 



(2) 



where a u = \/(ffy> 



) 2 is the standard deviation of 



the flow growth rate of the connection and (■ • •) 

denotes a time average over the period studied. The cor- 
relation matrix is real symmetric and its elements are 



comprised between — 1 (anti-correlated connections) and 
1 (correlated connections), while a null value denotes sta- 
tistical independence. 

The quantities g%j/(Tij have (by construction) a vari- 
ance equal to one and a zero mean (for a sufficiently long 
time). It is thus natural to compare our empirical re- 
sults with a mutual independent time-series — the 'null' 
hypothesis — described by the correlation matrix 



R = — AA f 

L 



(3) 



where A (the so-called random Wishart matrix) is an 
N x L matrix containing N times series of L random in- 
dependent elements with zero mean and unit variance (A* 
denotes the transpose of A). Each element of R can be 
written as R(ij)(u) — (aijiki) where dij(t) is a time series 
of independent elements with zero mean ({aij) = 0) and 
unit variance (cry = 1). 



1. Eigenvalues 

The probability distribution of the elements of C shows 
that most on the elements are positive (Fig. ||) which 
indicates a strong correlation among the whole network. 
For comparison, the elements of R are distributed accord- 
ing to a centered distribution with zero mean. We now 
study the statistical properties of C by applying RMT 
techniques. We first diagonalize C and obtain its eigen- 
values Afe (k = 1, • • • , N) which we sort from the largest 
to the smallest. We then calculate the eigenvalue dis- 
tribution and compare it with the analytical result for 
a cross-correlation matrix generated from finite uncorre- 
cted time series |2?| in the limit N — » oo, L — > oo where 
Q = L/N > 1 is fixed 



Prm(X) 



Q V /(A+-A)(A-A_) 



2tt A 
with A € [A_ , A + ] and where 

A±(Q) =l + l/Q±2/y/Q 



(4) 



(5) 



The eigenvalue distribution of C is very different from 
Equ. (0) which predicts a finite range of eigenvalues de- 
pending on the ratio Q. The theoretical value is Q — 2.58 
and we can reasonably fit the empirical curve with an ef- 
fective value Q* — 1.1 (Fig. ||a). This effective value can 
be explained as resulting from time correlations in the 
traffic of the order of ^-xr ~ 11 minutes. However, even 
this fit cannot reproduce the large eigenvalues observed: 
For Q* — 1.1 the theoretical eigenvalues are distributed 
in the interval 2.17 x 10~ 3 < A fc < 3.82 while few — a to- 
tal of order 20 — measured eigenvalues (not all shown on 
the graph) are found above A+(Q*) = 3.82. The largest 
eigenvalue is of order Ai ~ 200 namely approximately 
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hundred times larger than the maximum eigenvalue pre- 
dicted for uncorrelated time series. As we will see, the 
empirical distribution of eigenvector components for the 
large eigenvalues is 'flat', all components being of the 
same order. This suggests that the largest eigenvalues 
are associated with strong correlations among the net- 
work. 

We also calculate the distribution of the nearest- 
neighbor spacings s — Xk+i — A^. We compare the em- 
pirical distribution of nearest-neighbor spacings with the 
RMT predictions for real symmetric random matrices. 
This class of matrices shares universal properties with 
the ensemble of matrices whose elements are distributed 
according to a Gaussian probability measure — the Gaus- 
sian orthogonal ensemble (GOE). We find good agree- 
ment (Fig. ||b) between the empirical data and Wigner's 
surmise 

^goe(s) = — exp \-~ s-J . (6) 

which indicates a 'level repulsion' existing in our system 
and means that the eigenvalues are correlated. 

2. Eigenvectors and Inverse Participation Ratio 

We now analyze the eigenvectors of C. We denote by 
Uk the eigenvector associated to the eigenvalue A& and if 
we normalize the eigenvectors such that u\. = N, it can 
be shown that in the Wishart case the components u of 
the eigenvectors are distributed according to the so-called 
Porter-Thomas distribution 

P H . -A=e-»'/> (7) 

In agreement with this result we find that eigenvec- 
tors corresponding to most eigenvalues in the 'bulk' of 
the spectrum (A^ not too large) follow this prediction 
(Fig. |a). 

On the other hand, eigenvectors with eigenvalues out- 
side the bulk (Afc > A+(Q*)) show marked deviations 
from the Gaussian distribution (Fig. 0b, c). In partic- 
ular, the vector corresponding to the largest eigenvalue 
Ai deviates significantly from the Gaussian distribution 
predicted by RMT (Fig. 0b). This eigenvector is the 
signature of a collective behavior — the network itself — 
for which all connections are correlated. This effect was 
already observed in the framework of stock correlations, 
the largest eigenvalue being in this case the entire market 

The distribution of the components of an eigenvector 
contains information about the number of connections 
contributing to it. In order to distinguish between one 
eigenvector with approximately equal components and 
another with a small number of large components we use 



the inverse participation ratio (IPR) introduced in the 
context of localization theory p9| , [30[ 



4=«Ew 4 . ( g ) 



where Uki , i = 1 , . . . , N = 650 are the components of 
eigenvector When the components of a vector are of 
the same order and distributed according to Equ. (R) , the 
average IPR is small and equal to 3/iV whereas a vector 
with only few non zero components leads to a IPR of 
order unity. The quantity Tfe = 3/ Ik is thus a measure 
of the number of vector components significantly differ- 
ent from zero. We compared for our empirical results 
and for uncorrelated time series with the same values of 
(N, L) (Fig. 0). For the latter case, T*. has small fluctua- 
tions around N = 650 indicating that all the vectors are 
extended |3C|| which means that almost all connections 
contribute to them. On the other hand, the empirical 
data show deviations of from N for the smallest and 
largest eigenvalues (except for the largest eigenvalue). In 
these cases, the number of contributing connections is 
much smaller than N ranging from a few connections to 
a few hundreds. These deviations of few orders of magni- 
tude of Ik from its average suggests that the vectors are 
localized ]3(i[ ] and that only a few connections contribute 
to them. As it will be illustrated on a simple example 
in the next section, these results have a clear meaning in 
the case of large eigenvalues for which the connections are 
correlated. In addition, it was also shown ( J24| and see 
below) that strongly correlated pairs of routers (which 
correspond to large components in the eigenvectors) also 
appear with a relative negative sign in the eigenvector 
for small eigenvalues. This explains why the lower band 
edge also displays localized vectors but there is no clear 
connection with the spectrum observed in localization in 
electronic systems [ j30| . 

In addition, our empirical results exhibit 'quasi- 
extended' states in the center of the band. These states 
consist essentially of a group of ~ 300 — 400 connections 
corresponding to eigenvalues of order 0.2 — 0.4. 

The physical picture which emerges is thus the fol- 
lowing. The largest eigenvalue has an eigenvector which 
Tk=i is of order N and thus represents the whole net- 
work. The eigenvectors which correspond to eigenvalues 
which deviate from pure random matrix theory corre- 
spond to genuine correlations in the network. We have 
shown that these 'deviating' eigenvectors (of the order 
of 20) have a small value of which means that these 
important correlations are localized and that a relatively 
small number of connections concentrate most of the ac- 
tivity [0. 
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3. Non-Universal Properties: Active Centers 

The detail of the components of the 'deviating' eigen- 
vectors give us information about the important correla- 
tions in the network. In particular, the largest compo- 
nents of the eigenvectors correspond to the most corre- 
lated connections. This can be seen on the simple follow- 
ing example of a 3 x 3 correlation matrix 



(9) 



where c (resp. c') denotes the strength of the (1, 2) (resp. 
(2,3)) correlation. If we denote the ratio of the correla- 
tion strengths rj — c 1 /c, the eigenvectors iti, u%, and U3 
are respectively 




1 

V 1 + v 2 

V 



-v 



1 



1 



v 



(10) 



and correspond respectively to the eigenvalues (sorted in 
decreasing order) 



1 + cy/1 + rf , 1 , 1 - cVl 



Tf 



(11) 



We thus see on this simple example that the components 
of the eigenvector u\ (corresponding to the largest eigen- 
value) identify the most correlated indices: For 77 -C 1, 
Mi ~ (1, 1, 0) and for r/ 3> 1 one obtains u\ cx (0, 1, 1). 

This remark shows that the eigenvectors are indeed im- 
portant for identifying the most correlated connections in 
the network. We note that the large correlations are also 
reflected in the components — but with a relative minus 
sign — of the eigenvectors for small eigenvalues. 

In the case of Renater, we have seen in the previous 
section that all the components of u\ are positive which 
indicates a correlation among the whole network. Even 
if all the components of u\ indicate correlations exist- 
ing in the network, the simple example above shows that 
its largest components correspond to the most correlated 
connections. We thus looked at the largest components 
of U\. A first fact is that a connection is always 

(strongly) correlated with the connection [j, i). This 
result is not surprising since for most operations (Web 
browsing, Telnet, etc), there is always a 'outgoing' flow 
which is a significant part of the 'incoming' flow. 

In order to look for other causes of correlations we plot 
on Fig. the histogram of occurrences h{i) of the router 
i in the set of the n most correlated connections 
which are given by the first n components of the eigen- 
vector ui corresponding to the largest eigenvalue. We 
compared the empirical results with the control case for 
increasing values of n (for n approaching the total num- 
ber of components N = 650 all the connections appear 
and the histogram of occurrences is flat). We observe 



marked differences between these two cases. In particu- 
lar, in the control case the histogram tends to be uniform 
while for Renater we observe persistent peaks. On the 
last plot (Fig. ||c), it is apparent that there are still some 
fluctuations in the control case but much less than in the 
empirical one. The persistency of peaks and the fact that 
they appear to be much larger than the average value 
suggest that it is very unlikely that they are just fluctu- 
ations due to noise. Therefore, not all routers appear in 
the most correlated connections and the peaks can thus 
be identified as important 'active centers'. These cen- 
ters are exchanging information with many other routers 
thereby inducing correlations between these connections. 

It is interesting to note that occurrence peaks also ap- 
pear in the components of the other deviating eigenvec- 
tors and would thus also correspond to active centers but 
at a lower level of correlation. 

At this stage, we would like to emphasize that this 
analysis highlights active center independently of the vol- 
ume of information exchanged. Indeed, in a volume flow 
analysis the 'small' routers even very active are com- 
pletely hidden by the 'big' routers which are receiving 
and emitting huge amounts of bytes. 



III. CORRELATIONS AND SELF-SIMILARITY 
IN THE WWW 

The Internet is an example of a complex network which 
shows existence of a collective behavior such as a phase 
transition to a congested regime ||. An important dis- 
covery was also the power-law decay of time correlations 
This self-similarity is usually explained on the basis 
of underlying distributions of WWW document sizes, ef- 
fect of user 'think time' and the addition of many such 
effects in a network |l7| . 

The present study shows that strong correlations be- 
tween different connections exist in the traffic network. 
This result together with the existence of a phase tran- 
sition, the existence of a power law decay of time corre- 
lation suggests that the large-scale data traffic dynamics 
could be described by a set of simple coupled stochastic 
differential equation, such as the Langevin equations with 
random interactions J33|] . The equation for the Internet 
activity on a given connection (i, j) would thus be 



dt 



F(9v(t))+^(t)+J2 J (v)(ki)9ki(t) (12) 



/,■/ 



where the function F is usually expanded for small g as 



F(g) 



-rg - ug 



(13) 



and describes the relaxation of a single isolated connec- 
tion. The random noise e is associated to the effect of 



4 



users and the quantity J(ij)(u) is the coupling between 
connections (ij) and (hi). In the absence of interaction, 
the correlation function < g(t)g(t + r) > decreases expo- 
nentially with a typical correlation time of order 1/r (for 
u = 0). When the coupling is strong enough, the system 
described by Equ. ( [l2| ) undergoes a transition to an or- 
dered state where all <?'s are centered around a non-zero 
value. At the transition point the correlation function is 
decaying as a power law |34]. 

In this model [Equ. (|12[)], the observed self-similarity 
in time is a consequence of the strong correlation existing 
in the network. This is in contrast with previous studies 
which explained the self-similarity as an effect of existing 
local power law distribution (such as the file size distri- 
bution). However, more data are needed for testing this 
hypothesis and the validity of Equ. (12) for the Internet 
traffic. 



IV. CONCLUSIONS 

In summary, the largest part of the correlation ma- 
trix of connections is random but also contains statistical 
information distinct from pure noise. The eigenvectors 
which correspond to eigenvalues outside of the RMT pre- 
dictions contain information about genuine traffic corre- 
lation. In particular, the largest components of eigenvec- 
tor Mi (which corresponds to the largest eigenvalue) indi- 
cate the most correlated connections. We found different 
origins for the observed correlations. First, a connection 
is always strongly correlated with which is ex- 
pected since for each process — such as web browsing for 
example — information is exchanged in both directions. 
Second, it appears that in the set of the strongly cor- 
related connections there is only a small number of dif- 
ferent routers which participate in different connections 
thereby inducing correlations. This support the idea of 
the existence of active centers which are either very ac- 
tive or very visited. More work and data — on larger space 
and time scales — are needed in order to understand more 
thoroughly the existence of such centers which seem to 
play an important role in the network traffic. 

The approach presented in this study thus seems to 
allow one to extract relevant correlations between differ- 
ent connections and might have potential applications to 
traffic management and optimization. In particular, this 
analysis focus on activity independently of the volume 
of information exchanged and can thus reveal some very 
active routers which are usually hidden by 'big' routers 
exchanging very large flows. 

Finally, the existence of strong correlations together 
with the existence of a phase transition and power-law 
decaying autocorrelation function suggest that the Inter- 
net traffic is similar to a spin glass close to the criti- 
cal point. In this hypothesis, the self-similarity appears 



naturally as the result of a collective behavior without 
resorting to pre-existing power laws. 

We thank F. Baccelli for stimulating and interesting 
discussions. This work was supported by the Equipe Re- 
seaux, S avoirs & Territoires, Ecole normale Superieure, 
Paris. 
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FIG. 2. Probability distribution for the correlation coeffi- 
cient calculated from 5-minutes flows in the Renater network 
for a 14 days period. The average value is positive indicating 
strong correlations among the whole network. 



mi 





FIG. 3. (a) The probability density of the eigenvalues of 
the normalized cross-correlation matrix C for the 650 con- 
nections for a 2-weeks period. The results are reasonably 
fitted by the analytical result obtained for cross-correlation 
matrices generated from uncorrelated time series (solid line, 
obtained from Equ. 4 with Q* = 1.1). There are however 
very large eigenvalues (not shown), the largest one being of 
order 200. (b) Nearest-neighbor spacing distribution of the 
eigenvalues of C after unfolding using the Gaussian broaden- 
ing procedure j27|. The solid line is the RMT prediction for 
the spacing distribution for the Gaussian orthogonal ensemble 
(GOE). 



FIG. 1. Map of the Renater network. There is a total 
of about 30 interconnected routers (of which 26 are effec- 
tively studied). We show on this map the physical connec- 
tions. The measured data consist in a flow matrix Fij(t) 
(with t — Tin, m — 0, ■ ■ ■ , L — 1 and i,j = 1, •••,26) 
which gives the effective flow exchange between routers i 
and j. For more det ails on this network, see the web page 



http : / /www, renater.fr and for an animated version of flows, 
see fittp •77barthes.ens.fr/metrologie/Renater01 . 
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(a) 




FIG. 4. Eigenvector component distribution (a) For eigen- 
values in the center of the spectrum. In this case, the em- 
pirical results are in agreement with the results of RMT 
which is the Porter-Thomas distribution represented by a 
solid line. (b,c) For large eigenvalues there is a clear devi- 
ation compared to RMT predictions represented by the solid 
line (Porter-Thomas distribution). For the largest eigenvalue, 
most of the components is non-zero and positive which indi- 
cates correlations among the whole network. 



1000 




200 400 
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FIG. 5. Reciprocal inverse participation ratio for each of 
the 650 eigenvectors (sorted for decreasing eigenvalues). As 
a control case, we show the corresponding result for uncor- 
rected independent time series of the same length as the 
data. Empirical data show small values at both edges of the 
spectrum, whereas the control shows only small fluctuations 
around the average value (3/1) = N = 650. 




water's number i 

FIG. 6. Number of occurrences of routers in the n 
most correlated connections (There is a total of 26 routers 
i — 1, ■ ■ • , 27, the router 24 is excluded of the present study 
for technical reasons). In each plot, we compared the em- 
pirical results with the control case (histogram in red). The 
arrows indicate the two most frequent routers for Renater. In 
cases (a) n = 30 and (b) n — 50, it is clear that not all routers 
are participating equally, (c) Case n = 100. The control case 
still fluctuates around its average (which is 200/26 ~ 7.70) 
but much less than the empirical case. This fact and the 
observed persistency for increasing n suggest that it is very 
unlikely that the empirical peaks are just fluctuations due to 
noise. These peaks corresponds probably to routers which are 
very active and which are exchanging information with many 
other routers, thereby inducing correlations in the network. 
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